Localized spectro-temporal features for automatic speech recognition
نویسنده
چکیده
Recent results from physiological and psychoacoustic studies indicate that spectrally and temporally localized time-frequency envelope patterns form a relevant basis of auditory perception. This motivates new approaches to feature extraction for automatic speech recognition (ASR) which utilize two-dimensional spectro-temporal modulation filters. The paper provides a motivation and a brief overview on the work related to Localized Spectro-Temporal Features (LSTF). It further focuses on the Gabor feature approach, where a feature selection scheme is applied to automatically obtain a suitable set of Gabor-type features for a given task. The optimized feature sets are examined in ASR experiments with respect to robustness and their statistical properties are analyzed. 1. Getting auditory ... again? The question whether knowledge about the (human) auditory system provides valuable contributions to the design of ASR systems is as old as the field itself. The topic has been discussed extensively elsewhere (e.g. [1]). After all these years, a major argument still holds, namely the large gap in performance between normal-hearing native listeners and state-of-the art ASR systems. Consistently, humans outperform machines by at least an order of magnitude [2]. Human listeners recognize speech even in very adverse acoustical environments with strong reverberation and interfering sound sources. However, this discrepancy between human and machine performance is not restricted to robustness alone. It is observed also in undisturbed conditions and very small context independent corpora, where higher level constraints (cognitive aspects, language model) do not play a role. Arguably this hints towards an insufficient feature extraction in machine recognition systems. It is argued here, that including LSTF streams provides another step towards human-like speech recognition. 2. Evidence for (spectro-)temporal processing in the auditory system Speech is characterized by its fluctuations across time and frequency. The latter reflect the characteristics of the human vocal cords and tract and are commonly exploited in ASR by using short-term spectral representations such as cepstral coefficients. The temporal properties of speech are targeted in ASR by dynamic (delta and delta-delta) features and temporal filtering and feature extraction techniques like RASTA [3] and TRAPS [4]. Nevertheless, speech clearly exhibits combined spectro-temporal modulations. This is due to intonation, co-articulation and the succession of several phonetic elements, e.g., in a syllable. Formant transitions, for example, result in diagonal features in a spectrogram representation of speech. This kind of pattern is captured by LSTF and explicitly targeted by the Gabor feature extraction method described below. 2.1. Neurophysiology Recent findings from a number of physiological experiments in different mammalian species have revealed the spectrotemporal receptive fields (STRF) of neurons in the primary auditory cortex. Individual neurons are sensitive to specific spectro-temporal patterns in the incoming sound signal. The results were obtained using reverse correlation techniques with complex spectro-temporal stimuli such as checkerboard noise or moving ripples [5]. The STRFs often clearly exceed one critical band in frequency, have multiple peaks and also show tuning to temporal modulation. In many cases the neurons are sensitive to the direction of spectro-temporal patterns (e.g. upward or downward moving ripples, c.f. Fig. 1), which indicates a combined spectro-temporal processing rather than consecutive stages of spectral and temporal filtering [6]. Still the STRF are mainly localized in time and frequency, generally spanning at most 250 ms and one or two octaves, respectively. The center frequency distributions of the linear modulation filter transfer function associated with the STRFs show a broad peak between 4 and 8 Hz in the ferret and at about 12 Hz in the cat [7]. In the visual cortex, STRFs are measured with (moving) orientated grating stimuli. The results very well match twodimensional Gabor functions [8]. Often, two neurons show very similar STRFs differing only by a π/2 phase shift. Two such cells combined provide for a translation-invariant detection of a given modulation pattern within a certain part of the visual field. Figure 1: Example of a diagonal STRF from a neuron in the primary auditory cortex of a ferret. Courtesy of David J. Klein [6]. Red colour(+) denotes excitatory, blue(-) inhibitory regions in the receptive field. 2.2. Psychoacoustics and human speech perception The neurophysiological findings fit well with psychoacoustic experiments on early auditory features [9]. A psychophysical reverse correlation technique was applied to analyze subjects performance in masking experiments with semi-periodic white noise. The resulting basic auditory feature patterns are distributed in time and frequency and in some cases comprised of several unconnected parts, very much resembling the STRF of cortical neurons. In psychoacoustic modelling, temporal modulation filterbank approaches become more and more accepted. The perception model (PEMO) of effective processing, for example, utilizes a bank of modulation bandpass filters for each single critical band to account for a number of fundamental psychoacoustic experiments [10]. However, psychoaoustical phenomena such as comodulation masking release (CMR) indicate crosschannel mechanisms. More recent models implicitly include localized spectro-temporal filters [11]. When Fletcher et al. examined speech intelligibility of human listeners, they found log sub-band classification error probability to be additive for nonsense syllable recognition tasks. This suggests independent temporal processing in a number of articulatory bands. Their work resulted in the definition of the articulation index, a model of human speech perception [12]. However, recent speech intelligibility experiments have shown that the combination of two distant narrow spectral channels or slits leads to a gain in intelligibility which is greater than predicted by the articulation index (e.g. [13]). The new data suggests some integration of information across frequency bands. Other experiments with artificially distorted modulation amplitude and phase in these bands showed a relatively high tolerance of e.g. phase distortions in speech intelligibility measurements [14]. This would indicate at least partly channel-independent processing. The peripheral channels in these experiments were more than one octave apart, ruling out only global spectral integration of information and still allowing for localized spectrotemporal features. 3. A brief history of temporal processing in automatic speech recognition Standard front ends, such as mel-cepstra or perceptual linear prediction, only represent the spectrum within short analysis frames and thereby tend to neglect very important dynamic patterns in the speech signal. This deficiency has been partly overcome by adding temporal derivatives in the form of delta and delta-delta features to the set. Delta features effectively provide for a comb filtering effect in the temporal modulation frequency domain. A number of different modulation filtering techniques in the cepstral or spectral domain have been developed since then. Depending on optional log amplitude compression, channel effects or additive noise can be reduced by temporal bandpass and highpass envelope filtering such as cepstral mean subtraction, RASTA processing [3], the modulation spectrogram [15] or adaptation loops of PEMO processing [16]. The usefulness of modulation bandpass filtering for ASR has been studied in detail and well matches the importance of individual modulation frequency ranges for speech intelligibility for human listeners [17]. New methods of purely temporal processing have been established motivated by Fletcher’s findings of independent processing in each frequency channel and the focus on dynamic aspects of speech. Most prominent examples are the TempoRAl PatternS (TRAPS) [4] which apply multi-layer perceptrons to classify current phonemes in each single critical band based on a temporal context of up to 1 s. Another approach is multi-band processing, for which features are calculated in broader subbands to reduce the effect of band-limited noise on the overall performance. However, all these feature extraction methods apply either spectral or temporal processing at a time. 4. A trend towards localized spectro-temporal features Neurophysiology and psychoacoustic research yields auditory features of varying extent and shape which can be categorized as purely spectral, purely temporal or spectro-temporal. In the ASR domain, feature extraction methods have been dominant so far, that are one-dimensional and of large extent in that dimension (such as a cepstral analysis over the whole spectrum at one point in time). Following the biological blueprint and adding localized, 2D spectro-temporal features yields several advantages: Diagonality LSTF very efficiently detect diagonal structures in spectro-temporal representations of speech such as formant transitions. Locality the limited extent of LSTF fosters robustness in additive noise and mitigates channels effects. Adaptivity the size of each individual LSTF can be matched to the type of pattern it is designed for (in contrast to e.g. high cepstral coefficients which are calculated over the whole spectrum for detecting two neighboring spectral peaks). Generality purely spectral and purely temporal LSTFs are akin to cepstral analysis and modulation bandpass filtering, respectively; therefore the class of LSTFs also includes existing types of feature extraction, with the restriction to localized processing. There are a number of different approaches to achieve spectro-temporal feature extraction for ASR, such as spectrotemporal modulation filtering [18], linear transformations to the spectrogram representation of speech, e.g. linear discriminant analysis (LDA), independent component analysis (ICA), and principal component analysis (PCA) [19], and the extension of TRAPS to more than one critical band [20]. Approaches to use artificial neural networks for ASR classify spectral features using temporal context on the order of 10 to 100 ms. Depending on the system, this is part of the back end as in the connectionist approach [21] or part of the feature extraction as in the Tandem system [22]. The main problem of LSTF is the large number of possible parameter combinations. This issue may be solved implicitly by automatic learning in neural networks with a spectrogram input and a long time window of e.g. 1 s. However, this is computationally expensive and prone to overfitting, as it requires large amounts of (labeled) training data, which are often unavailable. By putting further constraints on the spectro-temporal patterns, the number of free parameters can be decreased by several orders of magnitude. This is the case when a specific analytical function, such as sigma-pi cells [23] or the Gabor function [24], is explicitly demanded. This approach narrows the search to a certain sub-set and thereby some important features might be ignored. However, neurophysiological and psychoacoustic knowledge can be exploited for the choice of the prototype. Another promising approach is to apply un-supervised Figure 2: Real part of a discrete two-dimensional complex Gabor function with parameters ωt/2π = 6 Hz and ωf/2π ≈ 0.55 cycl./oct. centered at f0 = 1195 Hz. Sampling as for a melspectrogram primary feature matrix. machine learning techniques to derive a suitable set of features. Thus, a sparseness criterion was applied to derive optimal feature from spectro-temporal speech data [25]. The resulting features showed a close resemblance to the STRFs of cortical neurons in the auditory system. Similar techniques are widely used in the visual domain. 5. Localized spectro-temporal Gabor features for automatic speech recognition The STRF of cortical neurons and early auditory features derived in psychoacoustic experiments can be approximated, although somewhat simplified, by two-dimensional Gabor functions. The Gabor filter functions generally target twodimensional envelope fluctuations but include, as special cases, purely spectral (local cepstra – modulo the windowing function) and purely temporal (modulation bandpass) features. The latter resemble TRAPS or the RASTA impulse response and its derivatives [1] in terms of temporal extent and filter shape. The use of Gabor features for ASR has been proposed earlier and proven to be relatively robust in combination with a simple classifier [24]. By applying a ’wrapper approach’ to feature selection, optimized sets of Gabor features were obtained that also allowed for increased robustness in adverse acoustic conditions for digit strings in the Aurora 2 & 3 experimental setup [26]. These results were obtained by combining Gabor feature streams with other conventional feature streams.
منابع مشابه
Phoneme Classification Using Temporal Tracking of Speech Clusters in Spectro-temporal Domain
This article presents a new feature extraction technique based on the temporal tracking of clusters in spectro-temporal features space. In the proposed method, auditory cortical outputs were clustered. The attributes of speech clusters were extracted as secondary features. However, the shape and position of speech clusters change during the time. The clusters temporally tracked and temporal tra...
متن کاملSpectro-temporal Gabor features as a front end for automatic speech recognition
A novel type of feature extraction is introduced to be used as a front end for automatic speech recognition (ASR). Two-dimensional Gabor filter functions are applied to a spectro-temporal representation formed by columns of primary feature vectors. The filter shape is motivated by recent findings in neurophysiology and psychoacoustics which revealed sensitivity towards complex spectro-temporal ...
متن کاملMethods for capturing spectro-temporal modulations in automatic speech recognition
Psychoacoustical and neurophysiological results indicate that spectro-temporal modulations play an important role in sound perception. Speech signals, in particular, exhibit distinct spectro-temporal patterns which are well matched by receptive fields of cortical neurons. In order to improve the performance of automatic speech recognition (ASR) systems a number of different approaches are prese...
متن کاملSpectro-temporal directional derivative features for automatic speech recognition
We introduce a novel spectro-temporal representation of speech by applying directional derivative filters to the Melspectrogram, with the aim of improving the robustness of automatic speech recognition. Previous studies have shown that two-dimensional wavelet functions, when tuned to appropriate spectral scales and temporal rates, are able to accurately capture the acoustic modulations of speec...
متن کاملRobust Speech Recognition Based on Localized Spectro-temporal Features
In order to enhance automatic speech recognition performance in adverse conditions, localized spectro-temporal features (LSTF) are investigated, which are motivated by physiological measurements in the primary auditory cortex. In the Aurora2 experimental setup, Gabor-shaped LSTFs combined with a Tandem system yield robust performance with a feature set size of 30. If computational constraints a...
متن کاملMulti-stream to many-stream: using spectro-temporal features for ASR
We report progress in the use of multi-stream spectro-temporal features for both small and large vocabulary automatic speech recognition tasks. Features are divided into multiple streams for parallel processing and dynamic utilization in this approach. For small vocabulary speech recognition experiments, the incorporation of up to 28 dynamically-weighted spectro-temporal feature streams along w...
متن کامل